NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Tolerant Algorithms for Learning with Arbitrary Covariate Shift

Goel, Surbhi; Shetty, Abhishek; Stavropoulos, Konstantinos; Vasilyan, Arsen (December 2024, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems (NeurIPS 2024))

We study the problem of learning under arbitrary distribution shift, where the learner is trained on a labeled set from one distribution but evaluated on a different, potentially adversarially generated test distribution. We focus on two frameworks: PQ learning [GKKM'20], allowing abstention on adversarially generated parts of the test distribution, and TDS learning [KSV'23], permitting abstention on the entire test distribution if distribution shift is detected. All prior known algorithms either rely on learning primitives that are computationally hard even for simple function classes, or end up abstaining entirely even in the presence of a tiny amount of distribution shift. We address both these challenges for natural function classes, including intersections of halfspaces and decision trees, and standard training distributions, including Gaussians. For PQ learning, we give efficient learning algorithms, while for TDS learning, our algorithms can tolerate moderate amounts of distribution shift. At the core of our approach is an improved analysis of spectral outlier-removal techniques from learning with nasty noise. Our analysis can (1) handle arbitrarily large fraction of outliers, which is crucial for handling arbitrary distribution shifts, and (2) obtain stronger bounds on polynomial moments of the distribution after outlier removal, yielding new insights into polynomial regression under distribution shifts. Lastly, our techniques lead to novel results for tolerant testable learning [RV'23], and learning with nasty noise.
more » « less
Full Text Available
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Edelman, Benjamin L; Edelman, Ezra; Goel, Surbhi; Malach, Eran; Tsilivis, Nikolaos (December 2024, 38th Conference on Neural Information Processing Systems (NeurIPS 2024))

Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer's layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution. We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to n-grams for n is greater than 2.
more » « less
Full Text Available
Tolerant Algorithms for Learning with Arbitrary Covariate Shift

Goel, Surbhi; Shetty, Abhishek; Stavropoulos, Konstantinos; Vasilyan, Arsen (June 2024, https://doi.org/10.48550/arXiv.2406.02742)

This paper addresses the challenge of learning under arbitrary distribution shift, where models are trained on data from one distribution but evaluated on a potentially adversarially generated test distribution. The authors study two frameworks: PQ learning [Goldwasser, A. Kalai, Y. Kalai, Montasser NeurIPS 2020], which allows abstention on adversarial portions of the test distribution, and TDS learning [Klivans, Stavropoulos, Vasilyan COLT 2024], which permits abstention on the entire test set if a distribution shift is detected. Previous algorithms either relied on computationally hard primitives (even for simple classes) or abstained entirely with even small shifts. This work develops efficient algorithms for natural function classes such as intersections of halfspaces and decision trees, under standard training distributions like Gaussians. The key innovation is an improved analysis of spectral outlier-removal techniques from learning with nasty noise, enabling: (1) handling arbitrarily large fractions of outliers, crucial for arbitrary shifts, and (2) stronger bounds on polynomial moments post outlier-removal, yielding new insights into polynomial regression under distribution shift. The techniques also produce novel results for tolerant testable learning [Rubinfeld and Vasilyan STOC 2023] and learning with nasty noise.
more » « less
Full Text Available
Learning Narrow One-Hidden-Layer ReLU Networks

Chen, Sitan; Dou, Zehao; Goel, Surbhi; Klivans, Adam; Meka, Raghu (July 2023, Conference on Learning Theory)

We consider the well-studied problem of learning a linear combination of k ReLU activations with respect to a Gaussian distribution on inputs in d dimensions. We give the first polynomial-time algorithm that succeeds whenever k is a constant. All prior polynomial-time learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weight vectors being well-conditioned. Our approach is based on analyzing random contractions of higher-order moment tensors. We use a multi-scale analysis to argue that sufficiently close neurons can be collapsed together, sidestepping the conditioning issues present in prior work. This allows us to design an iterative procedure to discover individual neurons.
more » « less
Full Text Available
Learning Narrow One-Hidden-Layer ReLU Networks

Chen, Sitan; Dou, Zehao; Goel, Surbhi; Klivans, Adam; Meka, Raghu (July 2023, The Thirty Sixth Annual Conference on Learning Theory (COLT 2023))

Full Text Available
Understanding Contrastive Learning Requires Incorporating Inductive Biases

Saunshi, Nikunj; Ash, Jordan; Goel, Surbhi; Misra, Dipendra; Zhang, Cyril; Arora, Sanjeev; Kakade, Sham; Krishnamurthy, Akshay (January 2022, Proceedings of Machine Learning Research)

Full Text Available
Tight Hardness Results for Training Depth-2 ReLU Networks

Goel, Surbhi; Klivans, Adam; Manurangsi, Pasin; Reichman, Daniel (January 2021, ITCS 2021)

We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k>1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ϵ. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest k-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in (1/epsilon)^2. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on epsilon.
more » « less
Full Text Available
Gone Fishing: Neural Active Learning with Fisher Embeddings

Ash, Jordan; Goel, Surbhi; Krishnamurthy, Akshay; Kakade, Sham (January 2021, Advances in neural information processing systems)

Full Text Available
Efficiently Learning Adversarially Robust Halfspaces with Noise

Montasser Omar, Goel Surbhi (January 2020, Proceedings of the 37th International Conference on Machine Learning)

Full Text Available
Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

Goel, Surbhi; Gollakota, Aravind; Jin, Zhihan; Karmalkar, Sushrut; Klivans, Adam (July 2020, International Conference on Machine Learning)

We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions.
more » « less
Full Text Available

« Prev Next »

Search for: All records